CS432 - Fall 2001
Assignment 5 - B+-Tree
Deadline: November 20, 23:59pm
Check this page and the newsgroup frequently for updates. The extra
credit part of A3 will be graded in a week.
FAQ
·
Question: It seems that clock() returns only the
processor time, not the overall time.
Answer: Use time() instead to get accurate timings.
·
Question: In reporting the statistics, ie. running
time and number of page misses, do we have to vary the number of tuples in each
relation besides varying the buffer pool size?
Answer: Use a large number of tuples to get a stable running time and repeat
your experiments a few times as stated below in under "statistics
collection". You do not have to vary the number of tuples.
·
Question: When we do the join, should we include both
of the join attributes or just one copy?
Answer: Use the function MakeNewRecord() to make the new record.
Introduction
For this assignment you are expected to implement three join algorithms
and evaluate their relative performance.
Background
Let us quickly review some of the modules you will need for this assignment.
You should already be familiar with them from the the previous assignments.
In Minibase, a relation is implemented as a heap file, which is a
collection of records. Records can be inserted or deleted from a heap file, and
each record is uniquely identified by a record id. A scan is the interface used
to access records in a heap file, one by one.
An index provides fast access to records in the heap file, and currently
Minibase only supports B+-indices. Each entry in the index is a (key, record
id) pair. Entries can be inserted or deleted from an index. An index search
scan provides interface for accessing the records in the index.
Four classes are provided for you : (i.e. you only need to call methods
in these classses)
Refer to the reference section for the details of these classes and how
to use them.
Statistics collection
Augment the buffer manager class given to you to collect statistics.
Specifically, you should be able to tell how many PinPage requests have been
made and how many of those miss. It is suggested that you write two functions
to reset and report the statistics.
You can use the time() function (which returns the current time) to
determine out the running time of your join algorithm. You will need to include
time.h. Refer to the C++ help for more details on time().
You should let your algorithm run for a few times (the longer the better)
and report the average running time. Avoid running other CPU or I/O intensive
processes (Word, Netscape, etc.) while collecting statistics since clock()
reports actual time rather than CPU time. Print out average running times and
the number of misses for every join algorithm you implement.
Tuple-at-a-time nested loop join.
Start with this one, it is the simplest to implement.
Block nested loop join
The algorithm for this join method can be found in your textbook. Since
Minibase does not provide page-by-page access into a relation stored in a heap
file, you will simulate a "block" with an array storing the records.
The join function will take as a parameter, an integer B (block size = array
size) of the outer relation. You are not required to implement the double
buffering techniques or use hash tables.
The pseudocode for this join is:
For each block B in R
For each tuple s in S
For each tuple r in B
Match r with s
if Match then
Insert (r,s) into the result relation
Compare the performance of "Block nested loop" join for various
block sizes.
Index nested loop join
Implement this algorithm based on the pseudo-code in the textbook.
First create an unclustered B+-tree index on the inner relation.
Determine whether it is beneficial to build a B+-tree index for the purpose of
performing a single join operation by comparing the cost of this join with that
of other join methods.
Performance comparison
Simplifications
Source code
The source code for the project is located in the directory \\goose\courses\cs432-fall01\a5\code.
The directory contains:
You should write your code in <join-method>.cpp and main.cpp.
Investigate the code provided to understand how the test program works.
Functions in join.cpp and relation.cpp will
be useful for writing your join methods and for debugging. The function SortFile()
is particularly useful as an example on how to use HeapFile, Scan, BTreeFile
and BTreeFileScan. To start working on the project, copy the directory to your
workspace, and double-click on a6.dsw.
Hints
Reference
Those in bold are classes and types that are particularly useful for this
assignment. You don't need to know the rest to complete the assignment, but if
you wish to study the code provided, they may be of help.
Coding convention
You need to follow certain coding convention for all your assignments:
Submission procedure
How to hand in: You are a group of two students. Drop only the following
five files into your handin directory on goose (\\goose\courses\cs432-fall01\handinA5).
So if students with netids "abc1" and "xyz2" form a
group together, one of the two handin directories (for example the directory \\goose\courses\cs432-fal01\handinA5\abc1)
will contain the five files for the group. Make sure that your program can
be recompiled as needed. You can submit the executable file, but we probably
won't look at it. Delete the empty handin directory of the other student.
Keep a copy of the project in your own account just in case.
Grading criteria
Your assignments will be graded according to the following criteria: